Index wiki database: design and experiments

نویسنده

  • A. A. Krizhanovsky
چکیده

With the fantastic growth of Internet usage, information search in documents of a special type called a “wiki page” that is written using a simple markup language, has become an important problem. This paper describes the software architectural model for indexing wiki texts in three languages (Russian, English, and German) and the interaction between the software components (GATE, Lemmatizer, and Synarcher). The inverted file index database was designed using visual tool DBDesigner. The rules for parsing Wikipedia texts are illustrated by examples. Two index databases of Russian Wikipedia (RW) and Simple English Wikipedia (SEW) are built and compared. The size of RW is by order of magnitude higher than SEW (number of words, lexemes), though the growth rate of number of pages in SEW was found to be 14% higher than in Russian, and the rate of acquisition of new words in SEW lexicon was 7% higher during a period of five months (from September 2007 to February 2008). The Zipf's law was tested with both Russian and Simple Wikipedias. The entire source code of the indexing software and the generated index databases are freely available under GPL (GNU General Public License).

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Using Links to prototype a Database Wiki

Both relational databases and wikis have strengths that make them attractive for use in collaborative applications. In the last decade, database-backed Web applications have been used extensively to develop valuable shared biological references called curated databases. Databases offer many advantages such as scalability, query optimization and concurrency control, but are not easy to use and l...

متن کامل

Information filtering based on wiki index database

In this paper we present a profile-based approach to information filtering by an analysis of the content of text documents. The Wikipedia index database is created and used to automatically generate the user profile from the user’s document collection. The problem-oriented Wikipedia subcorpora are created (using knowledge extracted from the user profile) for each topic of user interests. The in...

متن کامل

Forecasting S&P 500 index using artificial neural networks and design of experiments

The main objective of this research is to forecast the daily direction of Standard & Poor's 500 (S&P 500) index using an artificial neural network (ANN). In order to select the most influential features (factors) of the proposed ANN that affect the daily direction of S&P 500 (the response), design of experiments are conducted to determine the statistically significant factors among 27 potential...

متن کامل

Collaborative ORM Data Modeling: Educational Experience using a Wiki

This case study reports on a classroom experience using a Wiki to design an ORM data model. Student teams developed ORM diagrams and were to present them in a top down unfolding fashion along with an accompanying narrative description. Data modeling is typically a group effort involving several user domain subject matter experts. No one individual knows everything, but collaboration can capture...

متن کامل

The modENCODE Data Coordination Center: lessons in harvesting comprehensive experimental details

The model organism Encyclopedia of DNA Elements (modENCODE) project is a National Human Genome Research Institute (NHGRI) initiative designed to characterize the genomes of Drosophila melanogaster and Caenorhabditis elegans. A Data Coordination Center (DCC) was created to collect, store and catalog modENCODE data. An effective DCC must gather, organize and provide all primary, interpreted and a...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • CoRR

دوره abs/0808.1753  شماره 

صفحات  -

تاریخ انتشار 2008